Why is EDA so important?

Data can be very sneaky! Presumably everyone here has the same intuition: the data below are quite different from each other - right?


Meet the Datasaurus Dozen!

Meet the Datasaurus Dozen!


It turns out that if we only looked at commonly computed summary statistics -> we would think that they all are the same.


While different in appearance, each dataset has the same summary statistics (mean, standard deviation, and Pearson’s correlation) to two decimal places.

While different in appearance, each dataset has the same summary statistics (mean, standard deviation, and Pearson’s correlation) to two decimal places.


You REALLY need to intimately know your data! Here are seven distributions of data, shown as raw data points (or strip-plots), as box-plots, and as violin-plots.

You REALLY need to intimately know your data! Here are seven distributions of data, shown as raw data points (or strip-plots), as box-plots, and as violin-plots.

These fantastic visualizations and datasets are from here.


Learning objectives


  1. Intro to plotting data with ggplot2
  2. Data exploration - what your data “look” like?
    • Histograms and Density plots
    • Box-, violin-, and jitter-plots
    • Summary tables
  3. Comparing/contrasting data.
    • Correlation
    • PCA (principle components analysis)
    • Clustering

1. Intro to plotting data with ggplot2

R has many useful plotting functions that come installed (base functions), for example hist(), plot(), and barplot(). I typically use these functions on the fly while I’m coding to perform quick checks. However, they are a limited, a bit clunky, and far from aesthetically appealing for making convincing graphs/plots. A popular alternative is ggplot2. While graphs produced using ggplot2 defaults are (imo) not quite publication qualitym with a few tweaks you can pretty much get there! We will cover the following ggplot topics:


A. The basic syntax of ggplot()

ggplot(): build plots piece by piece The concept of ggplot divides a plot into three different fundamental parts:

plot = data + aesthetics + geometry.

data: a data frame
aesthetics: specify x and y variables, and other features - color, size, shape, etc.
geometry: specify type of plots - histogram, boxplot, line, density, dotplot, bar, etc.

B. Producing different types of plots

+ Histograms and density plots
+ Box-, violin-, and jitter-plots
+ Summary tables

C. Additional considerations (will return to in class #4)

+ Multiple plots
+ Themes
+ Titles, axes, and legends
+ Colors
+ Scales

2. Data exploration - what your data “look” like?

Ok let’s get this show on the road and import some data!!!

data_genelevel <- read_delim(here("2-class2",
                        "data",
                        "data_genelevel.tsv"),
                   delim = "\t",
                   escape_double = FALSE,
                   trim_ws = TRUE)
## Parsed with column specification:
## cols(
##   gene_symbol = col_character(),
##   hour00_rep1 = col_double(),
##   hour00_rep2 = col_double(),
##   hour00_rep3 = col_double(),
##   hour04_rep1 = col_double(),
##   hour04_rep2 = col_double(),
##   hour04_rep3 = col_double(),
##   hour08_rep1 = col_double(),
##   hour08_rep2 = col_double(),
##   hour08_rep3 = col_double(),
##   hour14_rep1 = col_double(),
##   hour14_rep2 = col_double(),
##   hour14_rep3 = col_double()
## )
lib_counts <- data.frame("lib"=colnames(data_genelevel[,-1]),
                         "counts"=colSums(data_genelevel[,-1])
                         )


summary(data_genelevel) # let's get a feel for the data
##  gene_symbol         hour00_rep1       hour00_rep2       hour00_rep3     
##  Length:10568       Min.   :    0.0   Min.   :    0.0   Min.   :    0.0  
##  Class :character   1st Qu.:   32.0   1st Qu.:   35.0   1st Qu.:   33.0  
##  Mode  :character   Median :   83.0   Median :   92.0   Median :   87.0  
##                     Mean   :  228.9   Mean   :  252.5   Mean   :  237.1  
##                     3rd Qu.:  201.0   3rd Qu.:  223.0   3rd Qu.:  206.2  
##                     Max.   :66502.4   Max.   :64463.6   Max.   :59309.1  
##   hour04_rep1       hour04_rep2       hour04_rep3       hour08_rep1     
##  Min.   :    0.0   Min.   :    0.0   Min.   :    0.0   Min.   :    0.0  
##  1st Qu.:   30.0   1st Qu.:   29.0   1st Qu.:   28.0   1st Qu.:   25.0  
##  Median :   82.0   Median :   80.0   Median :   74.0   Median :   74.0  
##  Mean   :  231.1   Mean   :  223.6   Mean   :  208.2   Mean   :  228.2  
##  3rd Qu.:  200.0   3rd Qu.:  195.0   3rd Qu.:  183.0   3rd Qu.:  195.0  
##  Max.   :55940.5   Max.   :51483.9   Max.   :44883.6   Max.   :69241.1  
##   hour08_rep2       hour08_rep3        hour14_rep1     hour14_rep2      
##  Min.   :    0.0   Min.   :    0.00   Min.   :    0   Min.   :    0.00  
##  1st Qu.:   26.0   1st Qu.:   24.01   1st Qu.:   17   1st Qu.:   14.00  
##  Median :   78.0   Median :   73.08   Median :   61   Median :   50.99  
##  Mean   :  237.1   Mean   :  225.26   Mean   :  227   Mean   :  188.20  
##  3rd Qu.:  205.0   3rd Qu.:  195.00   3rd Qu.:  187   3rd Qu.:  153.00  
##  Max.   :63622.7   Max.   :65946.49   Max.   :73482   Max.   :62746.33  
##   hour14_rep3      
##  Min.   :    0.00  
##  1st Qu.:   17.00  
##  Median :   60.26  
##  Mean   :  221.28  
##  3rd Qu.:  183.00  
##  Max.   :69037.02
# histogram
ggplot(data = data_genelevel, aes(x = hour00_rep1)) +
  geom_histogram(bins = 50) +
  theme_few() +
  ggtitle("histogram")

# histogram log10-scale
ggplot(data = data_genelevel, aes(x = log10(hour00_rep1))) +
  geom_histogram(bins = 50) +
  theme_few() +
  ggtitle("histogram log10-scale")
## Warning: Removed 124 rows containing non-finite values (stat_bin).

# histogram log10-scale -> my preferred route
ggplot(data = data_genelevel, aes(x = hour00_rep1)) +
  geom_histogram(bins = 50) +
  scale_x_log10() +
  theme_few() +
  ggtitle("histogram log10-scale")
## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Removed 124 rows containing non-finite values (stat_bin).

# histogram log10-scale with pseudocount
ggplot(data = data_genelevel, aes(x = hour00_rep1)) +
  geom_histogram(bins = 50) +
  scale_x_log10() +
  theme_few() +
  ggtitle("histogram log10-scale with pseudocount")
## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Removed 124 rows containing non-finite values (stat_bin).

# density log10-scale with pseudocount
ggplot(data = data_genelevel, aes(x = hour00_rep1 + 1)) +
  geom_density(bins = 50) +
  scale_x_log10() +
  theme_few() +
  ggtitle("density log10-scale with pseudocount")
## Warning: Ignoring unknown parameters: bins

# scattter plot
ggplot(data = data_genelevel, aes(x = hour00_rep1, y = hour00_rep2)) +
  geom_point() + 
  theme_few() +
  ggtitle("scattter plot")

# scattter plot log10-scale with pseudocount
ggplot(data = data_genelevel, aes(x = hour00_rep1 + 1, y = hour00_rep2 + 1)) +
  geom_point() + 
  scale_x_log10() +
  scale_y_log10() +
  theme_few() +
  ggtitle("scattter plot log10-scale with pseudocount")

long_genelevel <- reshape2::melt(data_genelevel) # convert from wide-to-long
## Using gene_symbol as id variables
names(long_genelevel) <- c("gene_symbol","Sample", "Count") # rename columns

# density of all samples log10-scale with pseudocount
ggplot(long_genelevel, aes(x=Count + 1, color = Sample)) + 
  geom_density() +
  scale_x_log10() +
  theme_few() +
  theme(legend.position="right") +
  ggtitle("density samples log10-scale with pseudocount")

# boxplot of all samples log10-scale with pseudocount
ggplot(long_genelevel, aes(y=Count + 1, x = Sample, color = Sample)) + 
  geom_boxplot() +
  scale_y_log10() +
  theme_few() +
  theme(legend.position="right") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("boxplot samples log10-scale with pseudocount")

# violinplot of all samples log10-scale with pseudocount
ggplot(long_genelevel, aes(y=Count + 1, x = Sample, color = Sample)) + 
  geom_violin() +
  scale_y_log10() +
  theme_few() +
  theme(legend.position="right") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("violinplot samples log10-scale with pseudocount")

3. Comparing/contrasting data

Let’s ask how similar the samples are to each other

Correlation

gene_cor_pearson <- cor(x = log10(data_genelevel[,-1] + 1), method = "pearson")


gene_cor_spearman <- cor(x = data_genelevel[,-1], method = "spearman")

corrplot(corr = gene_cor_pearson,
         addCoefasPercent = T,
         addCoef.col = "white",
         number.cex = .8,
         diag = T
         )

corrplot(corr = gene_cor_pearson,
         addCoefasPercent = T,
         addCoef.col = "white",
         number.cex = .8,
         diag = T,
         # col = rev(viridis(20)),
         order = "hclust",
         addrect = 4,
         rect.col = "red",
         title = "Pearson"
         )

corrplot(corr = gene_cor_spearman,
         addCoefasPercent = T,
         addCoef.col = "white",
         number.cex = .8,
         diag = T,
         # col = rev(viridis(20)),
         order = "hclust",
         addrect = 4,
         rect.col = "red",
         title = "Spearman"
         )

corrplot(corr = gene_cor_pearson,
         type = "upper",
         col = rev(viridis(20)),
         addCoefasPercent = T,
         addCoef.col = "white",
         number.cex = .8,
         diag = F,
         title = "Pearson Upper Triangular"
         )

Clustering

We’ve already played around with clustering in the correlation plot. However, it is important and so lets do it from scratch. You can cluster any matrix. We will cluster the matrix produced by the pearson correlation.

pheatmap(mat = gene_cor_pearson, 
         border_color = "black",
         cluster_rows = T,
         cluster_cols = T )

pheatmap(mat = gene_cor_pearson,
         clustering_method = "ward.D2", 
         border_color = "black",
         cluster_rows = T,
         cluster_cols = T )

Principle components analysis (PCA)

PCA is a common dimensionality reduction method that is used to visualize the similarity and differences in your data. Please watch this fantastic 5 minute video explaining PCA

Statquest really is the best! Anyway, for more detailed explanations go here and here.

We will use the prcomp() function to perform PCA analysis. Please note that you should typically center and scale your data.

pcDF <- prcomp(log10(data_genelevel[,-1] + 1), center = T, scale. = T) 

summary(pcDF) # summarize the PCs by variance
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5    PC6
## Standard deviation     3.2719 0.9644 0.40217 0.24271 0.15577 0.1426
## Proportion of Variance 0.8921 0.0775 0.01348 0.00491 0.00202 0.0017
## Cumulative Proportion  0.8921 0.9696 0.98310 0.98801 0.99003 0.9917
##                            PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.13460 0.13373 0.13089 0.12853 0.12375 0.11953
## Proportion of Variance 0.00151 0.00149 0.00143 0.00138 0.00128 0.00119
## Cumulative Proportion  0.99324 0.99473 0.99616 0.99753 0.99881 1.00000
a <- summary(pcDF) # assing the summary to the variable 'a'
a$importance[2,1] # then we can pull specific numbers out of 'a'
## [1] 0.89212

Let’s say we want ask how different the samples are to each other using the variance accross all genes.

pcDFData <- data.frame(pcDF$rotation) # we make a dataframe out of the rotations and will use this to plot


ggplot(pcDFData, aes(x=PC2, y=PC1)) +
  geom_point(size=2) +
  geom_text_repel(label=rownames(pcDFData)) +
  theme_few() +
  labs(ylab(paste("PC1 (%",100*round(a$importance[2,1], digits = 3),")", sep = "")
)) +
  labs(xlab(paste("PC2 (%",100*round(a$importance[2,2], digits = 3),")", sep = "")
)) +
  ggtitle("PCA analysis") 

pcDFData$ID <- rownames(pcDFData)
pcDFData <- pcDFData %>% separate(col = ID, sep = "_", into = c("time","rep"))


ggplot(pcDFData, aes(x=PC2, y=PC1)) +
  geom_point(size=2, aes(shape = rep, color = time)) +
  geom_text_repel(label=rownames(pcDFData)) +
  scale_color_manual(values = rev(viridis(4, option="D", end = .8))) +
  theme_few() +
  labs(ylab(paste("PC1 (%",100*round(a$importance[2,1], digits = 3),")", sep = "")
)) +
  labs(xlab(paste("PC2 (%",100*round(a$importance[2,2], digits = 3),")", sep = "")
)) +
  ggtitle("PCA analysis") 

Actual biological questions

Is DUX4 induced? What happens to DUX4 target genes with DUX4 induction over time? Do they go up? Do they go down? Do they do anything special?

dux_targets <- c("AC010606.1", "ALPPL2", "C1DP2", "CCNA1", "DUXA", "HNRNPCL1", "KDM4E", "KHDC1L", "KHDC1L", "KHDC1P1", "KLF17", "LEUTX", "MBD3L2", "MBD3L3", "MBD3L4", "MBD3L5", "PRAMEF1", "PRAMEF10", "PRAMEF11", "PRAMEF12", "PRAMEF13", "PRAMEF14", "PRAMEF15", "PRAMEF17", "PRAMEF18", "PRAMEF19", "PRAMEF2", "PRAMEF3", "PRAMEF4", "PRAMEF5", "PRAMEF6", "PRAMEF7", "PRAMEF8", "PRAMEF9", "RFPL2", "RFPL4B", "RP11-432M8.11", "RP11-432M8.17", "RP11-432M8.9", "RP11-554D14.4", "RP13-221M14.1", "RP13-221M14.3", "RP13-221M14.5", "SLC34A2", "TPRX1", "TRIM43", "TRIM43B", "TRIM43CP", "TRIM49","TRIM49B", "TRIM49C", "TRIM49DP", "TRIM49L1", "TRIM51", "TRIM51BP", "TRIM51CP", "TRIM51EP", "TRIM53AP", "TRIM53BP", "TRIM53CP", "WI2-2994D6.1", "WI2-2994D6.2", "WI2-3308P17.2", "XX-FW84067D5.1", "ZNF280A", "ZNF705G", "ZSCAN4")


long_genelevel$target <- if_else(condition = long_genelevel$gene_symbol %in% dux_targets,
        true = "target",
        false = "not target"
          )

ggplot(data = long_genelevel  %>% filter(gene_symbol=="DUXA"), aes(x = Sample, y = Count)) +
  geom_point() +
  scale_y_log10() +
  theme_few() +
  theme(legend.position="right") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("log10 DUXA counts")

ggplot(data = long_genelevel, aes(x = Sample, y = Count, fill = target)) +
  geom_boxplot() +
  scale_y_log10() +
  theme_few() +
  theme(legend.position="right") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("DUX4 targets vs not targets")
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 1251 rows containing non-finite values (stat_boxplot).

ggplot(data = long_genelevel, aes(x = Sample, y = Count, fill = target)) +
  geom_boxplot(outlier.shape=NA) +
  scale_y_log10() +
  theme_few() +
  theme(legend.position="right") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("DUX4 targets vs not targets")
## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Removed 1251 rows containing non-finite values (stat_boxplot).

Finally, we will always end our Rmarkdown documents with the sessing information: ***

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2.2    pheatmap_1.0.10   corrplot_0.84    
##  [4] here_0.1          viridis_0.5.1     viridisLite_0.3.0
##  [7] scales_1.0.0      gridExtra_2.3     reshape2_1.4.3   
## [10] ggrepel_0.8.0     ggthemes_4.0.0    forcats_0.3.0    
## [13] stringr_1.3.1     dplyr_0.7.6       purrr_0.2.5      
## [16] readr_1.1.1       tidyr_0.8.1       tibble_1.4.2     
## [19] ggplot2_3.0.0     tidyverse_1.2.1  
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.4   haven_1.1.2        lattice_0.20-35   
##  [4] colorspace_1.3-2   htmltools_0.3.6    yaml_2.2.0        
##  [7] rlang_0.2.2        pillar_1.3.0       glue_1.3.0        
## [10] withr_2.1.2        RColorBrewer_1.1-2 modelr_0.1.2      
## [13] readxl_1.1.0       bindr_0.1.1        plyr_1.8.4        
## [16] munsell_0.5.0      gtable_0.2.0       cellranger_1.1.0  
## [19] rvest_0.3.2        evaluate_0.11      labeling_0.3      
## [22] knitr_1.20         broom_0.5.0        Rcpp_0.12.18      
## [25] backports_1.1.2    jsonlite_1.5       hms_0.4.2         
## [28] digest_0.6.15      stringi_1.2.4      grid_3.5.1        
## [31] rprojroot_1.3-2    cli_1.0.0          tools_3.5.1       
## [34] magrittr_1.5       lazyeval_0.2.1     crayon_1.3.4      
## [37] pkgconfig_2.0.2    xml2_1.2.0         lubridate_1.7.4   
## [40] assertthat_0.2.0   rmarkdown_1.10     httr_1.3.1        
## [43] rstudioapi_0.7     R6_2.2.2           nlme_3.1-137      
## [46] compiler_3.5.1